- Diabetes is a prevalent disease worldwide
- Lots of diabetes-related data is available
- Motivation: Use data to help understand which factors contribute to being diagnosed
2022-05-08
National Health and Nutrition Examination Survey data concerning glycohemoglobin levels and diabetes mellitus (DM) from https://hbiostat.org/data/.
Why this dataset?
| Variable | Description | Units | Levels |
|---|---|---|---|
| seqn | Unique patient ID | ||
| sex | Gender | 0, 1 | |
| age | Age | Years | 12 - 80 |
| re | Race/ethnicity | 5 levels: White, Black, Mexican, Other Hispanic, Other | |
| income | Family income level | $ | 14 levels from 0 - 100000 |
| tx | On Insulin or Diabetes meds | 0, 1 | |
| dx | Diagnosed with DM or pre-DM | 0, 1 | |
| wt | Weight | kg | 28 - 239.4 |
| ht | Height | cm | 123.3 - 202.7 |
| bmi | Body-mass index | kg/m^2 | 13.18 - 84.87 |
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
| gh | Glycohemoglobin | % | 4 - 16.4 |
| albumin | Albumin | g/dL | 2.5 - 5.3 |
| bun | Blood urea nitrogen | mg/dL | 1 - 90 |
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
DX does not differentiate between type I and type II
| Variable | Description | Units | Levels |
|---|---|---|---|
| income | Family income level | $ | 14 levels from 0 - 100000 |
Here we assigned the mean from all non-NA values of income
| Variable | Description | Units | Levels |
|---|---|---|---|
| leg | Upper leg length | cm | 20.4 - 50.6 |
| arml | Upper arm length | cm | 24.8 - 47 |
| armc | Arm circumference | cm | 16.8 - 61 |
| waist | Waist circumference | cm | 52 - 179 |
| tri | Triceps skinfold thickness | mm | 2.6 - 41.1 |
| sub | Subscapular skinfold thickness | mm | 3.8 - 40.4 |
Here we implemented KNN (K=5) in tidyverse. We did not optimize K
Biochemical variables have more outliers
| Variable | Description | Units | Levels |
|---|---|---|---|
| SCr | Serum Creatinine | mg/dL | 0.14 - 15.66 |
Normal range is 0.6 - 1.2 mg/dL, 5+ indicates severe kidney impairment. We removed all values above 5 (17 total values). Source: https://www.medicinenet.com/creatinine_blood_test/article.htm
Positive correlations primarily betweeen body-size related variables
Weight and obesity levels as a contributing factor to diagnosis
Age as a contributing factor to diagnosis across BMI class
Treatment status of different ethnicity and age
Serum albumin levels in relation to diagnosis
Serum albumin is lower in diagnosed compared to non-diagnosed individuals
Investigation of patterns concerning diagnosis and treatment of diabetes mellitus
Variables dx, tx, leg, arml, wt and ht were excluded
Clusters between age and all other variables
Performing single parameter evaluation to have a baseline
The precision of the BMI & GH is 29% and 68% respectively
The precision of this model is at 80%
In summary: We cannot cluster patients based on these variables alone
Idea for further research: Appears that older people who have diabetes tend to be treated more often than younger people with diabetes